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Abstract 

Clustering  partitions  a  collection  of  objects  into  groups 
called  clusters,  such  that  similar  objects  fall  into  the  same 
group.  Similarity  between  objects  is  defined  by  a  distance 
function  satisfying  the  triangle  inequality;  this  distance 
function  along  with  the  collection  of  objects  describes  a  dis¬ 
tance  space.  In  a  distance  space,  the  only  operation  possi¬ 
ble  on  data  objects  is  the  computation  of  distance  between 
them.  All  scalable  algorithms  in  the  literature  assume  a  spe¬ 
cial  type  of  distance  space,  namely  a  k-dimensional  vector 
space,  which  allows  vector  operations  on  objects. 

We  present  two  scalable  algorithms  designed  for  cluster¬ 
ing  very  large  datasets  in  distance  spaces.  Our  first  algo¬ 
rithm  BUBBLE  is,  to  our  knowledge,  the  first  scalable  clus¬ 
tering  algorithm  for  data  in  a  distance  space.  Our  second 
algorithm  BUBBLE-FM  improves  upon  BUBBLE  by  reduc¬ 
ing  the  number  of  calls  to  the  distance  function,  which  may 
be  computationally  very  expensive.  Both  algorithms  make 
only  a  single  scan  over  the  database  while  producing  high 
clustering  quality.  In  a  detailed  experimental  evaluation, 
we  study  both  algorithms  in  terms  of  scalability  and  quality 
of  clustering.  We  also  show  results  of  applying  the  algo¬ 
rithms  to  a  real-life  dataset. 

1.  Introduction 

Data  clustering  is  an  important  data  mining  problem 
[1,  8,  9,  10,  12,  17,  21,  26],  The  goal  of  clustering  is  to 
partition  a  collection  of  objects  into  groups,  called  clusters, 
such  that  “similar”  objects  fall  into  the  same  group.  Simi¬ 
larity  between  objects  is  captured  by  a  distance  function. 

In  this  paper,  we  consider  the  problem  of  clustering  large 
datasets  in  a  distance  space  in  which  the  only  operation  pos¬ 
sible  on  data  objects  is  the  computation  of  a  distance  func¬ 
tion  that  satisfies  the  triangle  inequality.  In  contrast,  objects 
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in  a  coordinate  space  can  be  represented  as  vectors.  The 
vector  representation  allows  various  vector  operations,  e.g., 
addition  and  subtraction  of  vectors,  to  form  condensed  rep¬ 
resentations  of  clusters  and  to  reduce  the  time  and  space 
requirements  of  the  clustering  problem  [4,  26] .  These  oper¬ 
ations  are  not  possible  in  a  distance  space  thus  making  the 
problem  much  harder.1 

The  distance  function  associated  with  a  distance  space 
can  be  computationally  very  expensive  [5],  and  may  dom¬ 
inate  the  overall  resource  requirements.  For  example,  con¬ 
sider  the  domain  of  strings  where  the  distance  between  two 
strings  is  the  edit  distance.2  Computing  the  edit  distance 
between  two  strings  of  lengths  rn  and  n  requires  0(mn ) 
comparisons  between  characters.  In  contrast,  computing  the 
Euclidean  distance  between  two  n-dimensional  vectors  in  a 
coordinate  space  requires  just  Off)  operations.  Most  algo¬ 
rithms  in  the  literature  have  paid  little  attention  to  this  par¬ 
ticular  issue  when  devising  clustering  algorithms  for  data  in 
a  distance  space. 

In  this  work,  we  first  abstract  out  the  essential  features 
of  the  BIRCH  clustering  algorithm  [26]  into  the  BIRCH* 
framework  for  scalable  clustering  algorithms.  We  then  in¬ 
stantiate  BIRCH*  resulting  in  two  new  scalable  clustering 
algorithms  for  distance  spaces:  BUBBLE  and  BUBBLE- 
FM. 

The  remainder  of  the  paper  is  organized  as  follows.  In 
Section  2,  we  discuss  related  work  on  clustering  and  some 
of  our  initial  approaches.  In  Section  3,  we  present  the 
BIRCH*  framework  for  fast,  scalable,  incremental  clus¬ 
tering  algorithms.  In  Sections  4  and  5,  we  instantiate  the 
framework  for  data  in  a  distance  space  resulting  in  our  al¬ 
gorithms  BUBBLE  and  BUBBLE-FM.  Section  6  evaluates 
the  performance  of  BUBBLE  and  BUBBLE-FM  on  syn¬ 
thetic  datasets.  We  discuss  an  application  of  BUBBLE-FM 

1 A  distance  space  is  also  referred  to  as  an  arbitrary  metric  space.  We 
use  the  term  distance  space  to  emphasize  that  only  distance  computations 
are  possible  between  objects.  We  call  an  n-dimensional  space  a  coordinate 
space  to  emphasize  that  vector  operations  like  centroid  computation,  sum, 
and  difference  of  vectors  are  possible. 

2  The  edit  distance  between  two  strings  is  the  number  of  simple  edit 
operations  required  to  transform  one  string  into  the  other. 
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to  a  real-life  dataset  in  Section  7  and  conclude  in  Section  8. 

We  assume  that  the  reader  is  familiar  with  the  definitions 
of  the  following  standard  terms:  metric  space,  Lp  norm  of  a 
vector,  radius,  and  centroid  of  a  set  of  points  in  a  coordinate 
space.  (See  the  full  paper  [16]  for  the  definitions.) 

2.  Related  Work  and  Initial  Approaches 

In  this  section,  we  discuss  related  work  on  clustering, 
and  three  important  issues  that  arise  when  clustering  data 
in  a  distance  space  vis-a-vis  clustering  data  in  a  coordinate 
space. 

Data  clustering  has  been  extensively  studied  in  the 
Statistics  [20],  Machine  Learning  [12,  13],  and  Pattern 
Recognition  literature  [6,  7].  These  algorithms  assume  that 
all  the  data  fits  into  main  memory,  and  typically  have  run¬ 
ning  times  super-linear  in  the  size  of  the  dataset.  Therefore, 
they  do  not  scale  to  large  databases. 

Recently,  clustering  has  received  attention  as  an  impor¬ 
tant  data  mining  problem  [8,  9,  10,  17,  21,  26].  CLARANS 
[21]  is  a  medoid-based  method  which  is  more  efficient  than 
earlier  medoid-based  algorithms  [18],  but  has  two  draw¬ 
backs:  it  assumes  that  all  objects  fit  in  main  memory,  and 
the  result  is  very  sensitive  to  the  input  order  [26].  Tech¬ 
niques  to  improve  CLARANS’s  ability  to  deal  with  disk- 
resident  datasets  by  focussing  only  on  relevant  parts  of  the 
database  using  7?*-trees  were  also  proposed  [9,  10].  But 
these  techniques  depend  on  R*  -trees  which  can  only  in¬ 
dex  vectors  in  a  coordinate  space.  DBSCAN  [8]  uses  a 
density-based  notion  of  clusters  to  discover  clusters  of  ar¬ 
bitrary  shapes.  Since  DBSCAN  relies  on  the  /T'-Tree  for 
speed  and  scalability  in  its  nearest  neighbor  search  queries, 
it  cannot  cluster  data  in  a  distance  space.  BIRCH  [26]  was 
designed  to  cluster  large  datasets  of  //-dimensional  vectors 
using  a  limited  amount  of  main  memory.  But  the  algorithm 
relies  heavily  on  vector  operations,  which  are  defined  only 
in  coordinate  spaces.  CURE  [17]  is  a  sampling-based  hier¬ 
archical  clustering  algorithm  that  is  able  to  discover  clusters 
of  arbitrary  shapes.  However,  it  relies  on  vector  operations 
and  therefore  cannot  cluster  data  in  a  distance  space. 

Three  important  issues  arise  when  clustering  data  in  a 
distance  space  versus  data  in  a  coordinate  space.  First,  the 
concept  of  a  centroid  is  not  defined.  Second,  the  distance 
function  could  potentially  be  computationally  very  expen¬ 
sive  as  discussed  in  Section  1.  Third,  the  domain-specific 
nature  of  clustering  applications  places  requirements  that 
are  tough  to  be  met  by  just  one  algorithm. 

Many  clustering  algorithms  [4,  17,  26]  rely  on  vector 
operations,  e.g.,  the  calculation  of  the  centroid,  to  repre¬ 
sent  clusters  and  to  improve  computation  time.  Such  algo¬ 
rithms  cannot  cluster  data  in  a  distance  space.  Thus  one  ap¬ 
proach  is  to  map  all  objects  into  a  fc-dimensional  coordinate 
space  while  preserving  distances  between  pairs  of  objects 
and  then  cluster  the  resulting  vectors. 


Multidimensional  scaling  (MDS)  is  a  technique  for 
distance-preserving  transformations  [25].  The  input  to  a 
MDS  method  is  a  set  ,S',„  of  N  objects,  a  distance  func¬ 
tion  d,  and  an  integer  k ;  the  output  is  a  set  Sout  of  N  k- 
dimensional  image  vectors  in  a  ///-dimensional  coordinate 
space  (also  called  the  image  space),  one  image  vector  for 
each  object,  such  that  the  distance  between  any  two  objects 
is  equal  (or  very  close)  to  the  distance  between  their  respec¬ 
tive  image  vectors.  MDS  algorithms  do  not  scale  to  large 
datasets  for  two  reasons.  First,  they  assume  that  all  objects 
fit  in  main  memory.  Second,  most  MDS  algorithms  pro¬ 
posed  in  the  literature  compute  distances  between  all  pos¬ 
sible  pairs  of  input  objects  as  a  first  step  thus  having  com¬ 
plexity  at  least  0{N'2)  [19].  Recently,  Lin  et  al.  developed 
a  scalable  MDS  method  called  FastMap  [11],  FastMap  pre¬ 
serves  distances  approximately  in  the  image  space  while  re¬ 
quiring  only  a  fixed  number  of  scans  over  the  data.  There¬ 
fore,  one  possible  approach  for  clustering  data  in  a  distance 
space  is  to  map  all  N  objects  into  a  coordinate  space  us¬ 
ing  FastMap,  and  then  cluster  the  resultant  vectors  using  a 
scalable  clustering  algorithm  for  data  in  a  coordinate  space. 
We  call  this  approach  the  Map-First  option  and  empirically 
evaluate  it  in  Section  6.2.  Our  experiments  show  that  the 
quality  of  clustering  thus  obtained  is  not  good. 

Applications  of  clustering  are  domain-specific  and  we 
believe  that  a  single  algorithm  will  not  serve  all  require¬ 
ments.  A  pre-clustering  phase,  to  obtain  a  data-dependent 
summarization  of  large  amounts  of  data  into  sub-clusters, 
was  shown  to  be  very  effective  in  making  more  complex 
data  analysis  feasible  [4,  24,  26].  Therefore,  we  take  the 
approach  of  developing  a  pre-clustering  algorithm  that  re¬ 
turns  condensed  representations  of  sub-clusters.  A  domain- 
specific  clustering  method  can  further  analyze  the  sub¬ 
clusters  output  by  our  algorithm. 

3.  BIRCH* 

In  this  section,  we  present  the  BIRCH*  framework  which 
generalizes  the  notion  of  a  cluster  feature  (CF)  and  a  CF- 
tree,  the  two  building  blocks  of  the  BIRCH  algorithm  [26]. 
In  the  BIRCH*  family  of  algorithms,  objects  are  read  from 
the  database  sequentially  and  inserted  into  incrementally 
evolving  clusters  which  are  represented  by  generalized  clus¬ 
ter  features  ( CF*s).  A  new  object  read  from  the  database  is 
inserted  into  the  closest  cluster,  an  operation,  which  poten¬ 
tially  requires  an  examination  of  all  existing  CF*s.  There¬ 
fore  BIRCH*  organizes  all  clusters  in  an  in-memory  index, 
a  height-balanced  tree,  called  a  CF*-tree.  For  a  new  ob¬ 
ject,  the  search  for  an  appropriate  cluster  now  requires  time 
logarithmic  in  the  number  of  clusters  as  opposed  to  a  linear 
scan. 

In  the  remainder  of  this  section,  we  abstractly  state  the 
components  of  the  BIRCH*  framework.  Instantiations  of 
these  components  generate  concrete  clustering  algorithms. 


3.1.  Generalized  Cluster  Feature 

Any  clustering  algorithm  needs  a  representation  for  the 
clusters  detected  in  the  data.  The  naive  representation  uses 
all  objects  in  a  cluster.  However,  since  a  cluster  corresponds 
to  a  dense  region  of  objects,  the  set  of  objects  can  be  treated 
collectively  through  a  summarized  representation.  We  will 
call  such  a  condensed,  summarized  representation  of  a  clus¬ 
ter  its  generalized  cluster  feature  ( CF*). 

Since  the  entire  dataset  usually  does  not  fit  in  main  mem¬ 
ory,  we  cannot  examine  all  objects  simultaneously  to  com¬ 
pute  CF*s  of  clusters.  Therefore,  we  incrementally  evolve 
clusters  and  their  CF*s,  i.e.,  objects  are  scanned  sequen¬ 
tially  and  the  set  of  clusters  is  updated  to  assimilate  new 
objects.  Intuitively,  at  any  stage,  the  next  object  is  inserted 
into  the  cluster  “closest”  to  it  as  long  as  the  insertion  does 
not  deteriorate  the  “quality”  of  the  cluster.  (Both  concepts 
are  explained  later.)  The  CF*  is  then  updated  to  reflect  the 
insertion.  Since  objects  in  a  cluster  are  not  kept  in  main 
memory,  CF*s  should  meet  the  following  requirements. 

•  Incremental  updatability  whenever  a  new  object  is  in¬ 
serted  into  the  cluster. 

•  Sufficiency  to  compute  distances  between  clusters, 
and  quality  metrics  (like  radius)  of  a  cluster. 

CF*s  are  efficient  for  two  reasons.  First,  they  occupy 
much  less  space  than  the  naive  representation.  Second,  cal¬ 
culation  of  inter-cluster  and  intra-cluster  measurements  us¬ 
ing  the  CF*  s  is  much  faster  than  calculations  involving  all 
objects  in  clusters. 

3.2.  CF*-Tree 

In  this  section,  we  describe  the  structure  and  functional¬ 
ity  of  a  CF*-tree. 

A  CF*  -tree  is  a  height-balanced  tree  structure  similar  to 
the  /?* -tree  [3],  The  number  of  nodes  in  the  CF*-tree  is 
bounded  by  a  pre-specified  number  M.  Nodes  in  a  CF*- 
tree  are  classified  into  leaf  and  non-leaf  nodes  according  to 
their  position  in  the  tree.  Each  non-leaf  node  contains  at 
most  B  entries  of  the  form  (C F* ,childf),  i  £  {1, . . . ,  /!  }, 
where  childi  is  a  pointer  to  the  ith  child  node,  and  CF*  is 
the  CF*  of  the  set  of  objects  summarized  by  the  sub-tree 
rooted  at  the  ith  child.  A  leaf  node  contains  at  most  B  en¬ 
tries,  each  of  the  form  [CF*],  i  £  {1, . . . ,  £?};  each  leaf 
entry  is  the  CF*  of  a  cluster.  Each  cluster  at  the  leaf  level 
satisfies  a  threshold  requirement  T ,  which  controls  its  tight¬ 
ness  or  quality. 

The  purpose  of  the  CF*-tree  is  to  direct  a  new  object  0 
to  the  cluster  closest  to  it.  The  functionality  of  non-leaf  en¬ 
tries  and  leaf  entries  in  the  CF*-tree  is  different:  non-leaf 
entries  exist  to  “guide”  new  objects  to  appropriate  leaf  clus¬ 
ters,  whereas  leaf  entries  represent  the  dynamically  evolv¬ 
ing  clusters.  For  a  new  object  O,  at  each  non-leaf  node  on 


the  downward  path,  the  non-leaf  entry  “closest”  to  O  is  se¬ 
lected  to  traverse  downwards.  Intuitively,  directing  O  to  the 
child  node  of  the  closest  non-leaf  entry  is  similar  to  identi¬ 
fying  the  most  promising  region  and  zooming  into  it  for  a 
more  thorough  examination.  The  downward  traversal  con¬ 
tinues  till  O  reaches  a  leaf  node.  When  O  reaches  a  leaf 
node  L,  it  is  inserted  into  the  cluster  C  in  L  closest  to  O  if 
the  threshold  requirement  T  is  not  violated  due  to  the  inser¬ 
tion.  Otherwise,  O  forms  a  new  cluster  in  L.  If  L  does  not 
have  enough  space  for  the  new  cluster,  it  is  split  into  two 
leaf  nodes  and  the  entries  in  L  redistributed:  the  set  of  leaf 
entries  in  L  is  divided  into  two  groups  such  that  each  group 
consists  of  “similar”  entries.  A  new  entry  for  the  new  leaf 
node  is  created  at  its  parent.  In  general,  all  nodes  on  the 
path  from  the  root  to  L  may  split.  We  omit  the  details  of  the 
insertion  of  an  object  into  the  CF*-tree  because  it  is  similar 
to  that  of  BIRCH  [26]. 

During  the  data  scan,  existing  clusters  are  updated  and 
new  clusters  are  formed.  The  number  of  nodes  in  the  CF*- 
tree  may  increase  beyond  M  before  the  data  scan  is  com¬ 
plete  due  to  the  formation  of  many  new  clusters.  Then 
it  is  necessary  to  reduce  the  space  occupied  by  the  CF*- 
tree  which  can  be  done  by  reducing  the  number  of  clus¬ 
ters  it  maintains.  The  reduction  in  the  number  of  clusters  is 
achieved  by  merging  close  clusters  to  form  bigger  clusters. 
BIRCH*  merges  clusters  by  increasing  the  threshold  value 
T  associated  with  the  leaf  clusters  and  re -inserting  them  into 
a  new  tree.  The  re -insertion  of  a  leaf  cluster  into  the  new 
tree  merely  inserts  its  CF*;  all  objects  in  leaf  clusters  are 
treated  collectively.  Thus  a  new,  smaller  CF*  -tree  is  built. 
After  all  the  old  leaf  entries  have  been  inserted  into  the  new 
tree,  the  data  scan  resumes  from  the  point  of  interruption. 

Note  that  the  CF*  -tree  insertion  algorithm  requires  dis¬ 
tance  measures  between  the  “inserted  entries”  and  node  en¬ 
tries  to  select  the  closest  entry  at  each  level.  Since  insertions 
are  of  two  types:  insertion  of  a  single  object,  and  that  of  a 
leaf  cluster,  the  BIRCH*  framework  requires  distance  mea¬ 
sures  to  be  instantiated  between  a  CF*  and  an  object,  and 
between  two  CF*s  (or  clusters). 

In  summary,  CF*s,  their  incremental  maintenance,  the 
distance  measures,  and  the  threshold  requirement  are  the 
components  of  the  BIRCH*  framework,  which  have  to  be 
instantiated  to  derive  a  concrete  clustering  algorithm. 

4.  BUBBLE 

In  this  section,  we  instantiate  BIRCH*  for  data  in  a  dis¬ 
tance  space  resulting  in  our  first  algorithm  called  BUB- 
BFE.  Recall  that  CF*  s  at  leaf  and  non-leaf  nodes  differ  in 
their  functionality.  The  former  incrementally  maintain  in¬ 
formation  about  the  output  clusters,  whereas  the  latter  are 
used  to  direct  new  objects  to  appropriate  leaf  clusters.  Sec¬ 
tions  4. 1  and  4.2  describe  the  information  in  a  CF*  (and  then 
incremental  maintenance)  at  the  leaf  and  non-leaf  levels. 


4.1.  CF*s  at  the  leaf  level 

4.1.1  Summary  statistics  at  the  leaf  level 

For  each  cluster  discovered  by  the  algorithm,  we  return  the 
following  information  (which  is  used  in  further  processing): 
the  number  of  objects  in  the  cluster,  a  centrally  located  ob¬ 
ject  in  it  and  its  radius.  Since  a  distance  space,  in  general, 
does  not  support  creation  of  new  objects  using  operations 
on  a  set  of  objects,  we  assign  an  actual  object  in  the  cluster 
as  the  cluster  center.  We  define  the  clustroid  0  of  a  set  of 
objects  0  which  is  the  generalization  of  the  centroid  to  a 
distance  space.3  We  now  introduce  the  RowSum  of  an  ob¬ 
ject  O  with  respect  to  a  set  of  objects  0,  and  the  concept 
of  an  image  space  IS(0)  of  a  set  of  objects  0  in  a  distance 
space.  Informally,  the  image  space  of  a  set  of  objects  is  a 
coordinate  space  containing  an  image  vector  for  each  object 
such  that  the  distance  between  any  two  image  vectors  is  the 
same  as  the  distance  between  the  corresponding  objects. 

In  the  remainder  of  this  section,  we  use  (S,d)  to  denote  a 
distance  space  where  S  is  the  domain  of  all  possible  objects 
and  d  :  S  x  S  TZ  is  a  distance  function. 

Definition  4.1  Let  0  =  {Oi, . . . ,  On}  be  a  set  of  ob¬ 
jects  in  a  distance  space  (, S,d).  The  RowSum  of  an  ob¬ 
ject  O  E  0  is  defined  as  RowSum(O)  E"- 1  d2(0,0j). 
The  clustroid  0  is  defined  as  the  object  0  E  0  such  that 
VO  E  0  :  RowSum(0)  <  RowSum(O). 

Definition  4.2  Let  0  =  {O i, ....  0„}  be  a  set  of  objects 
in  a  distance  space  (, S,d ).  Let  /  :  0  TZk  be  a  func¬ 
tion.  We  call  /  an  lZk -distance-preserving  transformation 

if  Vi,  j  E  {  L  .  II }  :  d(OuOj)  =  |  |/(0,)  -  f(Oj)  \  \ 
where  |  .V  —  V  \  |  is  the  Euclidean  distance  between  X  and  V 
in  lZk.  We  call  lZk  the  image  space  of  0  under  /  (denoted 
IS/(0)).  For  an  object  O  E  O,  we  call  /(O)  the  image  vec¬ 
tor  of O  under / .  Wedefine/(0)  d=  {/(Oi), . . . , /(On)}. 

The  existence  of  a  distance-preserving  transformation  is 
guaranteed  by  the  following  lemma. 

Lemma  4.1  [19]  Let  0  be  a  set  of  objects  in  a  distance 
space  ( S,d ).  Then  there  exists  a  positive  integer  k  {k  < 
|0|)  and  a  function  /  :  0  h->  lZk  such  that  /  is  an  lZk- 
distance-preserving  transformation. 

For  example,  three  objects  x.  y.  z  with  the  inter-object  dis¬ 
tance  distribution  [d(x.y)  =  3,d(y,z)  =  4 ,d(z,x)  = 

3The  medoid  Ok  of  a  set  of  objects  O  is  sometimes  used  as  a  cluster 
center  [18].  It  is  defined  as  the  object  Om  E  O  that  minimizes  the  average 
dissimilarity  to  all  objects  in  O  (i.e.,  d(0i,0)  is  minimum  when 

O  =  Om).  But,  it  is  not  possible  to  motivate  the  heuristic  maintenance — 
a  la  clustroid — of  the  medoid.  However,  we  expect  similar  heuristics  to 
work  even  for  the  medoid. 


5]  can  be  mapped  to  vectors  (0,  0),  (3,  0),  (0, 4)  in  the  2- 
dimensional  Euclidean  space.  This  is  one  of  many  possible 
mappings. 

The  following  lemma  shows  that  under  any  lZk  -distance- 
preserving  transformation  /,  the  clustroid  of  0  is  the  object 
O  E  O  whose  image  vector  f(O)  is  closest  to  the  centroid 
of  the  set  of  image  vectors  f{0).  Thus,  the  clustroid  is  the 
generalization  of  the  centroid  to  distance  spaces.  Following 
the  generalization  of  the  centroid,  we  generalize  the  defi¬ 
nitions  of  the  radius  of  a  cluster,  and  the  distance  between 
clusters  to  distance  spaces. 

Lemma  4.2  Let  0  =  {O i , . . . ,  On  }  be  a  set  of  objects  in  a 
distance  space  ( S ,  d )  with  clustroid  O  and  let  /  :  O  h->  lZk 
be  a  -distance -preserving  transformation.  Let  O  be  the 
centroid  of  f{0).  Then  the  following  holds: 

VO  E  O  :  \\f(6) -0\\  <  ||/(0)  -OH 

Definition  4.3  Let  O  =  {Oi , ....  0„}  be  a  set  of  objects 
in  a  distance  space  ( S ,  d)  with  clustroid  O.  The  radius  r(O) 

f,n  •  .  ,  .  un,  def  ,/E"- irf2(0,,o) 

of  O  is  defined  as  r(O)  =  y  1  - . 

Definition  4.4  We  define  two  different  inter-cluster  dis¬ 
tance  metrics  between  cluster  features.  Let  0\  and  C)-> 
be  two  clusters  consisting  of  objects  {On, . . . ,  0\ni }  and 
{Ooi,...,02„2}.  Let  their  clustroids  be  0\  and  CL 
respectively.  We  define  the  clustroid  distance  Do  as 
D0 (0i,  CL)  =f  d(0i,CL)  and  the  average  inter-cluster 

def  E"1  E"2  d2(0u,02j)  i 

distance  D2  as  D-2(01,02)  =  - )t» 

Both  BUBBLE  and  BUBBLE-FM  use  Dq  as  the  distance 
metric  between  leaf  level  clusters,  and  as  the  threshold 
requirement  T,  i.e.,  a  new  object  Onew  is  inserted  into 
a  cluster  0  with  clustroid  0  only  if  L>o(0,  {Onew})  = 
d(0,  Onew)  <  T.  (The  use  for  D2  is  explained  later.) 

4.1.2  Incremental  maintenance  of  leaf-level  CF*s 
In  this  section,  we  describe  the  incremental  maintenance  of 
CF*s  at  the  leaf  levels  of  the  CF*-tree.  Since  the  sets  of 
objects  we  are  concerned  with  in  this  section  are  clusters, 
we  use  C  (instead  of  0)  to  denote  a  set  of  objects. 

The  incremental  maintenance  of  the  number  of  objects 
in  a  cluster  C  is  trivial.  So  we  concentrate  next  on  the 
incremental  maintenance  of  the  clustroid  C.  Recall  that 
for  a  cluster  C,  C  is  the  object  in  C  with  the  minimum 
RowSum  value.  As  long  as  we  are  able  to  keep  all  the  ob¬ 
jects  of  C  in  main  memory,  we  can  maintain  C  incrementally 
under  insertions  by  updating  the  RowSum  values  of  all  ob¬ 
jects  O  E  C  and  then  selecting  the  object  with  minimum 
RowSum  value  as  the  clustroid.  But  this  strategy  requires 
all  objects  in  C  in  main  memory,  which  is  not  a  viable  op¬ 
tion  for  large  datasets.  Since  exact  maintenance  is  not  pos¬ 
sible,  we  develop  a  heuristic  strategy  which  works  well  in 


practice  while  significantly  reducing  main  memory  require¬ 
ments.  We  classify  insertions  in  clusters  into  two  types. 
Type  I  and  Type  II,  and  motivate  heuristics  for  each  type 
of  insertion.  A  Type  I  insertion  is  the  insertion  of  a  single 
object  or,  equivalently,  a  cluster  containing  only  one  object. 
Each  object  in  the  dataset  causes  a  Type  I  insertion  when  it 
is  read  from  the  data  file,  making  it  the  most  common  type 
of  insertion.  A  Type  II  insertion  is  the  insertion  of  a  cluster 
containing  more  than  one  object.  Type  II  insertions  occur 
only  when  the  CF*-tree  is  being  rebuilt.  (See  Section  3.) 
Type  I  Insertions:  In  our  heuristics,  we  make  the  follow¬ 
ing  approximation:  under  any  distance-preserving  transfor¬ 
mation  /  into  a  coordinate  space,  the  image  vector  of  the 
clustroid  is  the  centroid  of  the  set  of  all  image  vectors,  i.e., 
/(C)  =  f(C).  From  Lemma  4.2,  we  know  that  this  is  the 
best  possible  approximation.  In  addition  to  the  approxima¬ 
tion,  our  heuristic  is  motivated  by  the  following  two  obser¬ 
vations. 

Observation  1:  Consider  the  insertion  of  a  new  object 
Onew  into  a  cluster  C  =  {Oi, . . . ,  On }  and  assume  that 
only  a  subset  R  C  C  is  kept  in  main  memory.  Let  /  :  C  U 
{Onew}  i-A  Rk  be  a  TZk -distance-preserving  transforma¬ 
tion  from  C  U  {Onew}  into  lZk,  and  let  f(C)  =  — — 

be  the  centroid  of  /(C).  Then,  RowSum(Oneu,) 

n  n 

=  £V  (One^Oj)  =  £(/(0_)  ~  f(0:))2 

j  I  j  I 

n 

=  £(/(0;)  -  W))2  +  n(m  -  f(Onew ))2 

3  =  1 

«  nr 2  (C)  +  nd 2  (C,  Onew ) 

Thus,  we  can  calculate  RowSum(Onetu)  approximately  us¬ 
ing  only  C  and  significantly  reduce  the  main  memory  re¬ 
quirements. 

Observation  2:  Let  C  =  {Oi, . . . ,  On}  be  a  leaf  cluster 
in  the  CF*-tree  and  an  object  which  is  inserted  into 

C.  Let  C  and  C*  be  the  clustroids  of  C  and  C  U  {0„e„,} 
respectively.  Let  Do  be  the  distance  metric  between  leaf 
clusters  and  T  the  threshold  requirement  of  the  CF*-tree 
undergo-  Then 

„  „  T 

d(C,C*)  <  e,  where  e  =  - - -. 

{n  +  1) 

An  implication  of  Observation  2  is  that  as  long  as  we 
keep  a  set  R  C  C  of  objects  consisting  of  all  objects  in 
C  within  a  distance  of  r  from  C,  we  know  that  C*  G  R. 
However,  when  the  clustroid  changes  due  to  the  insertion  of 
Onew,  we  have  to  update  R  to  consist  of  all  objects  within  r 
of  C*.  Since  we  cannot  assume  that  all  objects  in  the  dataset 
fit  in  main  memory,  we  have  to  retrieve  objects  in  C  from  the 
disk.  Repeated  retrieval  of  objects  from  the  disk,  whenever 


a  clustroid  changes,  is  expensive.  Fortunately  (from  Obser¬ 
vation  2),  if  n  is  large  then  the  new  set  of  objects  within  e 
of  C*  is  almost  the  same  as  the  old  set  R  because  C*  is  very 
close  to  C. 

Observations  1  and  2  motivate  the  following  heuris¬ 
tic  maintenance  of  the  clustroid.  As  long  as  |C|  is  small 
(smaller  than  a  constant  p),  we  keep  all  the  objects  of  C  in 
main  memory  and  compute  the  new  clustroid  exactly.  If 
|C|  is  large  (larger  than  p ),  we  invoke  Observation  2  and 
maintain  a  subset  of  C  of  size  p.  These  p  objects  have  the 
lowest  RowSum  values  in  C  and  hence  are  closest  to  C.  If 
the  RowSum  value  of  Onew  is  less  than  the  highest  of  the 
p  values,  say  that  of  Op,  then  Onew  replaces  Op  in  R.  Our 
experiments  confirm  that  this  heuristic  works  very  well  in 
practice. 

Type  II  Insertions:  Let  C\  =  {  On, . . . ,  Oini  }  and 
C 2  =  {  0-21 , . . . ,  Oo„2  }  be  two  clusters  being  merged. 
Let  /  be  a  distance-preserving  transformation  of  C\  U  Co 
into  lZk.  Let  A /  be  the  image  vector  in  IS(Ci  U  Co) 
of  object  Oij  under  /.  Let  A/  (AA)  be  the  centroid  of 
. . . ,  Aini }  ({X2i, . . . ,  X-2n2  })■  Let  Ci  i  be  their 
clustroids,  and  r(Ci),r(Co)  be  their  radii.  Let  C*  be  the 
clustroid  of  C\  U  Co . 

The  new  centroid  X  of  /(Ci  U  Co)  lies  on  the  line  join¬ 
ing  A/  and  AA ;  its  exact  location  on  the  line  depends  on 
the  values  of  n\  and  no.  Since  Dq  is  used  as  the  threshold 
requirement  for  insertions,  the  distance  between  X  and  A"  i 
is  bounded  as  shown  below: 

—  o  _  n}(Xi  —  AT)2  _  np2(Cl5Co)  ^  n^T 

'  lJ  (ni+n2)2  ~  (m +  n2)2  <  (m +  n2)2 

The  following  two  assumptions  motivate  the  heuristic 
maintenance  of  the  clustroid  under  Type  II  insertions. 

(i)  Ci  and  Co  are  non-overlapping  but  very  close  to  each 
other.  Since  Ci  and  Co  are  being  merged,  the  threshold 
criterion  is  satisfied  implying  that  C i  and  Co  are  close  to 
each  other.  We  expect  the  two  clusters  to  be  almost  non¬ 
overlapping  because  they  were  two  distinct  clusters  in  the 
old  CF*-tree. 

(ii)  n i  n o.  Due  to  lack  of  any  prior  information  about 
the  clusters,  we  assume  that  the  objects  are  uniformly  dis¬ 
tributed  in  the  merged  cluster.  Therefore,  the  values  of  n  \ 
and  no  are  close  to  each  other  in  Type  II  insertions. 

For  these  two  reasons,  we  expect  the  new  clustroid  C* 
to  be  midway  between  Ci  and  Co,  which  corresponds  to  the 
periphery  of  either  cluster.  Therefore  we  maintain  a  few 
objects  (p  in  number)  on  the  periphery  of  each  cluster  in  its 
CFA  Because  they  are  the  farthest  objects  from  the  clus¬ 
troid,  they  have  the  highest  RowSum  values  in  their  respec¬ 
tive  clusters. 

Thus  we  overall  maintain  2 -p  objects  for  each  leaf  cluster 
C,  which  we  call  the  representative  objects  of  C;  the  value 
2 p  is  called  the  representation  number  of  C.  Storing  the 


representative  objects  enables  the  approximate  incremental 
maintenance  of  the  clustroid.  The  incremental  maintenance 
of  the  radius  of  C  is  similar  to  that  of  RowSum  values;  de¬ 
tails  are  given  in  the  full  paper  [16]. 

Summarizing,  we  maintain  the  following  information  in 
the  CF*  of  a  leaf  cluster  C:  (i)  the  number  of  objects  in  C, 
(ii)  the  clustroid  of  C,  (iii)  2  •  p  representative  objects,  (iv) 
the  RowSum  values  of  the  representative  objects,  and  ( v)  the 
radius  of  the  cluster.  All  these  statistics  are  incrementally 
maintainable — as  described  above — as  the  cluster  evolves. 

4.2.  CF*s  at  Non-leaf  Level 

In  this  section,  we  instantiate  the  cluster  features  at  non¬ 
leaf  levels  of  the  BIRCH*  framework  and  describe  their  in¬ 
cremental  maintenance. 

4.2.1  Sample  Objects 

In  the  BIRCH*  framework,  the  functionality  of  a  CF*  at 
a  non-leaf  entry  is  to  guide  a  new  object  to  the  sub-tree 
which  contains  its  prospective  cluster.  Therefore,  the  clus¬ 
ter  feature  of  the  ith  non-leaf  entry  NL,  of  a  non-leaf  node 
NL  summarizes  the  distribution  of  all  clusters  in  the  subtree 
rooted  at  NL,.  In  Algorithm  BUBBLE,  this  summary,  the 
CF*,  is  represented  by  a  set  of  objects;  we  call  these  ob¬ 
jects  the  sample  objects  S(NL;)  of  NL,  and  the  union  of  all 
sample  objects  at  all  the  entries  the  sample  objects  S(NL)  of 
NL. 

We  now  describe  the  procedure  for  selecting  the  sam¬ 
ple  objects.  Let  child] . . . . , child/,  be  the  child  nodes  at 
NL  with  m, . . .  ,rik  entries  respectively.  Let  S(NL;)  de¬ 
note  the  set  of  sample  objects  collected  from  childi  and  as¬ 
sociated  with  NL;.  S(NL)  is  the  union  of  sample  objects 
at  all  entries  of  NL.  The  number  of  sample  objects  to  be 
collected  at  any  non-leaf  node  is  upper  bounded  by  a  con¬ 
stant  called  the  sample  size  (SS).  The  number  |S(NL;)|  con¬ 
tributed  by  childi  is  MAX(  ":*6S  ,1).  The  restric- 

tion  that  each  child  node  have  at  least  one  representative  in 
S(NL)  is  placed  so  that  the  distribution  of  the  sample  ob¬ 
jects  is  representative  of  all  its  children,  and  is  also  neces¬ 
sary  to  define  distance  measures  between  a  newly  inserted 
object  and  a  non-leaf  cluster.  If  childi  is  a  leaf  node,  then 
the  sample  objects  S(NL;)  are  randomly  picked  from  all  the 
clustroids  of  the  leaf  clusters  at  childi.  Otherwise,  they  are 
randomly  picked  from  childi s  sample  objects  S (childi). 

4.2.2  Updates  to  Sample  Objects 

The  CF*-tree  evolves  gradually  as  new  objects  are  inserted 
into  it.  The  accuracy  of  the  summary  distribution  captured 
by  sample  objects  at  a  non-leaf  entry  depends  on  how  re¬ 
cently  the  sample  objects  were  gathered.  The  periodicity  of 
updates  to  these  samples,  and  when  these  updates  are  ac¬ 
tually  triggered,  affects  the  currency  of  the  samples.  Each 


time  we  update  the  sample  objects  we  incur  a  certain  cost. 
Thus  we  have  to  strike  a  balance  between  the  cost  of  updat¬ 
ing  the  sample  objects  and  their  currency. 

Because  a  split  at  childi  of  NL  causes  redistribution  of 
its  entries  between  childi  and  the  new  node  child k+ i,  we 
have  to  update  samples  S(NL,)  and  S(NL/,4  | )  at  entries  NL; 
and  NLfc+i  of  the  parent  (we  actually  create  samples  for  the 
new  entry  NL/,  :  | ).  However,  to  reflect  changes  in  the  dis¬ 
tributions  at  all  children  nodes  we  update  the  sample  objects 
at  all  entries  of  NL  whenever  one  of  its  children  splits. 

4.2.3  Distance  measures  at  non-leaf  levels 

Let  Onew  be  a  new  object  inserted  into  the  CF*-tree. 
The  distance  between  Onew  and  NL;  is  defined  to  be 
-D2({0neli,},S(NL;)).  Since  D2{{Onew},  0)  is  meaning¬ 
less,  we  ensure  that  each  non-leaf  entry  has  at  least  one 
sample  object  from  its  child  during  the  selection  of  sample 
objects.  Let  L;  represent  the  ith  leaf  entry  of  a  leaf  node  L. 
The  distance  between  C  and  L;  is  defined  to  be  the  clustroid 
distance  Do(C,L;). 

The  instantiation  of  distance  measures  completes  the  in¬ 
stantiation  of  BIRCH*  deriving  BUBBLE.  We  omit  the  the 
cost  analysis  of  BUBBLE  because  it  is  similar  to  that  of 
BIRCH. 

5.  BUBBLE-FM 

While  inserting  a  new  object  Onew,  BUBBLE  computes 
distances  between  Onew  and  all  the  sample  objects  at  each 
non-leaf  node  on  its  downward  path  from  the  root  to  a  leaf 
node.  The  distance  function  d.  may  be  computationally  very 
expensive  (e.g.,  the  edit  distance  on  strings).  We  address 
this  issue  in  our  second  algorithm  BUBBLE-FM — which 
improves  upon  BUBBLE  by  reducing  the  number  of  in¬ 
vocations  of  d — using  FastMap  [11],  We  first  give  a  brief 
overview  of  FastMap  and  then  describe  BUBBLE-FM. 

5.1.  Overview  of  FastMap 

Given  a  set  O  of  N  objects,  a  distance  function  d.  and  an 
integer  k,  FastMap  quickly  (in  time  linear  in  N)  computes 
V  vectors  (called  image  vectors),  one  for  each  object,  in  a 
fc -dimensional  Euclidean  image  space  such  that  the  distance 
between  two  image  vectors  is  close  to  the  distance  between 
the  corresponding  two  objects.  Thus,  FastMap  is  an  “ap¬ 
proximate”  1Z k  -distance-preserving  transformation.  Each 
of  the  k  axes  is  defined  by  the  line  joining  two  objects.4 
The  2k  objects  are  called  pivot  objects.  The  space  defined 
by  the  k  axes  is  th e  fastmapped  image  space  IS^jO)  of 
O.  The  number  of  calls  to  d  made  by  FastMap  to  map  N 
objects  is  3 Nkc,  where  c  is  a  parameter  (typically  set  to  1 
or  2). 

An  important  feature  of  FastMap  that  we  use  in 
BUBBLE-FM  is  its  fast  incremental  mapping  ability.  Given 


4 See  Lin  et.  al.  for  details  [11], 


a  new  object  0new,  FastMap  projects  it  onto  the  k  coordi¬ 
nate  axes  of  lSfm{0)  to  compute  a  fc-dimensional  vector 
for  Onew  in  IS  fm{0)  with  just  2k  calls  to  d.  Distance  be¬ 
tween  Onew  and  any  object  ()  C  O  can  now  be  measured 
through  the  Euclidean  distance  between  their  image  vectors. 

5.2.  Description  of  BUBBLE-FM 

BUBBLE-FM  differs  from  BUBBLE  only  in  its  usage 
of  sample  objects  at  a  non-leaf  node.  In  BUBBLE-FM,  we 
first  map — using  FastMap — the  set  of  all  sample  objects  at  a 
non-leaf  node  into  an  “approximate”  image  space.  We  then 
use  the  image  space  to  measure  distances  between  an  in¬ 
coming  object  and  the  CF*s.  Since  CF*s  at  non-leaf  entries 
function  merely  as  guides  to  appropriate  children  nodes,  an 
approximate  image  space  is  sufficient.  We  now  describe  the 
construction  of  the  image  space  and  its  usage  in  detail. 

Consider  a  non-leaf  node  NL.  Whenever  S(NL)  is  up¬ 
dated,  we  use  FastMap  to  map  S(NL)  into  a  fc-dimensional 
coordinate  space  IS k  is  called  the  image  dimen¬ 
sionality  of  NL.  FastMap  returns  a  vector  for  each  object  in 
S(NL).  The  centroid  of  the  image  vectors  of  S(NL;)  is  then 
used  as  the  centroid  of  the  cluster  represented  by  NL,  while 
defining  distance  metrics. 

Let  /m:S(NL)n-IS/m(WL)  be  the  distance  preserv¬ 
ing  transformation  associated  with  FastMap  that  maps 
each  sample  object  s  £  S(NL)  to  a  fc-dimensional  vec¬ 
tor  fm(s )  £  ISfm(NL).  Let  S(NLi)  be  the  centroid 
of  the  set  of  image  vectors  of  S(NL;),  i.e.,  S{NLi )  = 

SaeStJVij)  f mU) 

\S(NL,)\ 

The  non-leaf  CF*  in  BUBBLE-FM  consists  of  (1)  S(NL;) 
and  (2)  S(NLi).  In  addition,  we  maintain  the  image  vectors 
of  the  2k  pivot  objects  returned  by  FastMap. 

The  2k  pivot  objects  define  the  axes  of  the  k-dimensional 
image  space  constructed  by  FastMap.  Let  Onew  be  a  new 
object.  Using  FastMap,  we  incrementally  map  Onew  to 
Vnew  £  lSfm{ArL).  We  define  the  distance  between  Onew 
and  NL;  to  be  the  Euclidean  distance  between  Vnew  and 
S(NLi).  Formally, 


D(Onew,S(NLt ))  =  \\Vnew  -  S(NLi)\\ 

Similarly,  the  distance  between  two  non-leaf  entries  NL; 
and  NLj  is  defined  to  be  \  \S(N Li)  —  S( NL j )  1 1 .  Whenever 
|S(NL)|  <  2k,  BUBBLE-FM  measures  distances  at  NL  in 
the  distance  space,  as  in  BUBBLE. 

5.2.1  An  alternative  at  the  leaf  level 

We  do  not  use  FastMap  at  the  leaf  levels  of  the  CF*  -tree  for 
the  following  reasons. 

1.  Suppose  FastMap  were  used  at  the  leaf  levels 
also.  The  approximate  image  space  constructed  by 


FastMap  does  not  accurately  reflect  the  relative  dis¬ 
tances  between  clustroids;  the  inaccuracy  causes  er¬ 
roneous  insertions  of  objects  into  clusters  deteriorat¬ 
ing  the  clustering  quality.  Similar  errors  at  non-leaf 
levels  merely  cause  new  entries  to  be  redirected  to 
wrong  leaf  nodes  where  they  will  form  new  clusters. 
Therefore,  the  impact  of  these  errors  is  on  the  mainte¬ 
nance  costs  of  the  CF*-tree,  but  not  on  the  clustering 
quality,  and  hence  are  not  so  severe. 

2.  If  lSfm(L)  has  to  be  maintained  accurately  un¬ 
der  new  insertions  then  it  should  be  reconstructed 
whenever  any  clustroid  in  the  leaf  node  L  changes. 
In  this  case,  the  overhead  of  repeatedly  invoking 
FastMap  offsets  the  gains  due  to  measuring  distances 
in  IS/m(T). 

5.2.2  Image  dimensionality  and  other  parameters 

The  image  dimensionalities  of  non-leaf  nodes  can  be  dif¬ 
ferent  because  the  sample  objects  at  each  non-leaf  node  are 
mapped  into  independent  image  spaces.  The  problem  of 
finding  the  right  dimensionality  of  the  image  space  has  been 
studied  well  [19].  We  set  the  image  dimensionalities  of  all 
non-leaf  nodes  to  the  same  value;  any  technique  used  to  find 
the  right  image  dimensionality  can  be  incorporated  easily 
into  the  mapping  algorithm. 

Our  experience  with  BUBBLE  and  BUBBLE-FM  on 
several  datasets  showed  that  the  results  are  not  very  sensi¬ 
tive  to  small  deviations  in  the  values  of  the  parameters:  the 
representation  number  and  the  sample  size.  We  found  that 
a  value  of  10  for  the  representation  number  works  well  for 
several  datasets  including  those  used  for  the  experimental 
study  in  Section  6.  An  appropriate  value  for  the  sample  size 
depends  on  the  branching  factor  BF  of  the  CF*-tree.  We 
observed  that  a  value  of  5  x  BF  works  well  in  practice. 


6.  Performance  Evaluation 

In  this  section,  we  evaluate  BUBBLE  and  BUBBLE- 
FM  on  synthetic  datasets.  Our  studies  show  that  BUB¬ 
BLE  and  BUBBLE-FM  are  scalable  high  quality  clustering 
algorithms.5 

6.1.  Datasets  and  Evaluation  Methodology 

To  compare  with  the  Map-First  option,  we  use  two 
datasets  DS1  and  DS2.  Both  DS1  and  DS2  have  100000 
2-dimensional  points  distributed  in  100  clusters  [26],  How¬ 
ever,  the  cluster  centers  in  DS 1  are  uniformly  distributed  on 
a  2-dimensional  grid;  in  DS2,  the  cluster  centers  are  dis¬ 
tributed  on  a  sine  wave.  These  two  datasets  are  also  used 
to  visually  observe  the  clusters  produced  by  BUBBLE  and 
BUBBLE-FM. 

We  also  generated  k-dimensional  datasets  as  described 
by  Agrawal  et  al.  [1],  The  A;-dimensional  box  [0, 10]fc  is  di¬ 
vided  into  2k  cells  by  halving  the  range  [0, 10]  over  each 
dimension.  A  cluster  center  is  randomly  placed  in  each 
of  K  cells  chosen  randomly  from  the  2k  cells,  where  K 
is  the  number  of  clusters  in  the  dataset.  In  each  cluster, 
j-  points  are  distributed  uniformly  within  a  radius  ran¬ 
domly  picked  from  [0.5, 1.0].  A  dataset  containing  N  k- 
dimensional  points  and  K  clusters  is  denoted  DSkd.Kc.N . 
Even  though  these  datasets  consist  of  /c-dimensional  vec¬ 
tors  we  do  not  exploit  the  operations  specific  to  coordinate 
spaces,  and  treat  the  vectors  in  the  dataset  merely  as  ob¬ 
jects.  The  distance  between  any  two  objects  is  returned  by 
the  Euclidean  distance  function. 

We  now  describe  the  evaluation  methodology.  The 
clustroids  of  the  sub-clusters  returned  by  BUBBLE  and 
BUBBLE-FM  are  further  clustered  using  a  hierarchical 
clustering  algorithm  [20]  to  obtain  the  required  number 
of  clusters.  To  minimize  the  effect  of  hierarchical  clus¬ 
tering  on  the  final  results,  the  amount  of  memory  allo¬ 
cated  to  the  algorithm  was  adjusted  so  that  the  number  of 
sub-clusters  returned  by  BUBBLE  or  BUBBLE-FM  is  very 
close  (not  exceeding  the  actual  number  of  clusters  by  more 
than  5%)  to  the  actual  number  of  clusters  in  the  synthetic 
dataset.  Whenever  the  final  cluster  is  formed  by  merging 
sub-clusters,  the  clustroid  of  the  final  cluster  is  the  centroid 
of  the  clustroids  of  sub-clusters  merged.  Other  parameters 
to  the  algorithm,  the  sample  size  (SS),  the  branching  fac¬ 
tor  ( B ),  and  the  representation  number  (2  ■  p)  are  fixed  at 
75,  15,  and  10  respectively  (unless  otherwise  stated)  as  they 
were  found  to  result  in  good  clustering  quality.  The  image 
dimensionality  for  BUBBLE-FM  is  set  to  be  equal  to  the  di¬ 
mensionality  of  the  data.  The  dataset  D  is  scanned  a  second 

°The  quality  of  the  result  from  BIRCH  was  shown  to  be  independent  of 
the  input  order  [26].  Since,  BUBBLE  and  BUBBLE-FM  are  instantiations 
of  the  BIRCH*  framework  which  is  abstracted  out  from  BIRCH,  we  do 
not  present  more  results  on  order-independence  here. 


time  to  associate  each  object  O  (E  D  with  a  cluster  whose 
representative  object  is  closest  to  O. 

We  introduce  some  notation  before  describing  the  evalu¬ 
ation  metrics.  Let  A\ , . . . ,  Ak  be  the  actual  clusters  in  the 
dataset  and  C\ , . . . ,  Ck  be  the  set  of  clusters  discovered  by 
BUBBLE  or  BUBBLE-FM.  Let  A,  (C,J  be  the  centroid  of 
cluster  Ai  ( Ci ).  Let  Ci  be  the  clustroid  of  Ci.  Let  n(C)  de¬ 
note  the  number  of  points  in  the  cluster  G.  We  use  the  fol¬ 
lowing  metrics,  some  of  which  are  traditionally  used  in  the 
Statistics  and  the  Pattern  Recognition  communities  [6,  7], 
to  evaluate  the  clustering  quality  and  speed. 

•  The  distortion  (^f=1  X]  xec  (-^  —  Cy)2)  of  a  set  of 
clusters  indicates  the  tightness  of  the  clusters. 

•  The  clustroid  quality  ( CQ  =  ! L-LilLlLI )  js  the  average 
distance  between  the  actual  centroid  of  a  cluster  A, 
and  the  clustroid  Cj  that  is  closest  to  _4,. 

•  The  number  of  calls  to  d  (NCD)  and  the  time  taken  by 
the  algorithm  indicate  the  cost  of  the  algorithm.  NCD 
is  useful  to  extrapolate  the  performance  for  computa¬ 
tionally  expensive  distance  functions. 

6.2.  Comparison  with  the  Map-First  option 

We  mapped  DS1,  DS2,  and  DS20d.50c.100K into  an  ap¬ 
propriate  k-dimensional  space  (k  =  2  for  DS1,  DS2,  and 
20  for  DS20d.50c.100K)  using  FastMap,  and  then  used 
BIRCH  to  cluster  the  resulting  k-dimensional  vectors. 

The  clustroids  of  clusters  obtained  from  BUBBLE  and 
BUBBLE-FM  on  DS2  are  shown  in  Figures  1  and  2  respec¬ 
tively,  and  the  centroids  of  clusters  obtained  from  BIRCH 
are  shown  in  Figure  3.  From  the  distortion  values  (Table  1), 
we  see  that  the  quality  of  clusters  obtained  by  BUBBLE  or 
BUBBLE-FM  is  clearly  better  than  the  Map-First  option. 


Dataset 

Map-First 

BUBBLE 

BUBBLE-FM 

DS1 

DS2 

DS20d.50c.100K 

195146 

1147830 
2.214  *  106 

129798 

125093 

21127.5 

122544 

125094 

21127.5 

Table  1 .  Comparison  with  the  Map-First  option 

6.3.  Quality  of  Clustering 

In  this  section,  we  use  the  dataset  DS20d.50c.100K.  To 
place  the  results  in  the  proper  perspective,  we  mention  that 
the  average  distance  between  the  centroid  of  each  cluster 
Ai  and  an  actual  point  in  the  dataset  closest  to  A,  is  0.212. 
Hence  the  clustroid  quality  (CQ)  cannot  be  less  than  0.212. 
From  Table  2,  we  observe  that  the  CQ  values  are  close  to  the 


Algorithm 

CQ 

Actual 

Distortion 

Computed 

Distortion 

BUBBLE 

BUBBLE-FM 

0.289 

0.294 

21127.4 

21127.4 

21127.5 

21127.5 

Table  2.  Clustering  Quality 


Figure  1.  DS2:  BUBBLE 


Figure  2.  DS2:  BUBBLE-FM 


Figure  3.  DS2:  BIRCH 
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Figure  4.  Time  vs  #points 


Figure  5.  NCD  vs  #points 


Figure  6.  Time  vs  #clusters 


minimum  possible  value  (0.212),  and  the  distortion  values 
match  almost  exactly.  Also,  we  observed  that  all  the  points 
except  a  few  (less  than  5)  were  placed  in  the  appropriate 
clusters. 

6.4.  Scalability 

To  study  scalability  characteristics  with  respect  to  the 
number  of  points  in  the  dataset,  we  fixed  the  number  of  clus¬ 
ters  at  50  and  varied  the  number  of  data  points  from  50000 
to  500000  (i.e.,  DS20d.50c.*). 

Figures  4  and  5  plot  the  time  and  NCD  values  for  BUB¬ 
BLE  and  BUBBLE-FM  as  the  number  of  points  is  in¬ 
creased.  We  make  the  following  observations,  (i)  Both  al¬ 
gorithms  scale  linearly  with  the  number  of  points,  which 
is  as  expected,  (ii)  BUBBLE  consistently  outperforms 
BUBBLE-FM.  This  is  due  to  the  overhead  of  FastMap  in 
BUBBLE-FM.  (The  distance  function  in  the  fastmapped 
space  as  well  as  the  original  space  is  the  Euclidean  distance 
function.  )  However,  the  constant  difference  between  their 
running  times  suggests  that  the  overhead  due  to  the  use  of 
FastMap  remains  constant  even  though  the  number  of  points 
increases.  The  difference  is  constant  because  the  overhead 
due  to  FastMap  is  incurred  only  when  the  nodes  in  the  CF*- 
tree  split.  Once  the  distribution  of  clusters  is  captured  the 
nodes  do  not  split  that  often  any  more,  (iii)  As  expected, 
BUBBLE-FM  has  smaller  NCD  values.  Since  the  overhead 
due  to  the  use  of  FastMap  remains  constant,  as  the  num¬ 
ber  of  points  is  increased  the  difference  between  the  NCD 
values  increases. 

To  study  scalability  with  respect  to  the  number  of  clus¬ 
ters,  we  varied  the  number  of  clusters  between  50  and  250 


while  keeping  the  number  of  points  constant  at  200000.  The 
results  are  shown  in  Figure  6.  The  plot  of  time  versus  num¬ 
ber  of  clusters  is  almost  linear.6 

7.  Data  Cleaning  Application 

When  different  bibliographic  databases  are  integrated, 
different  conventions  for  recording  bibliographic  items  such 
as  author  names  and  affiliations  cause  problems.  Users  fa¬ 
miliar  with  one  set  of  conventions  will  expect  their  usual 
forms  to  retrieve  relevant  information  from  the  entire  col¬ 
lection  when  searching.  Therefore,  a  necessary  part  of  the 
integration  is  the  creation  of  a  joint  authority  file  [2,  15]  in 
which  classes  of  equivalent  strings  are  maintained.  These 
equivalent  classes  can  be  assigned  a  canonical  form.  The 
process  of  reconciling  variant  string  forms  ultimately  re¬ 
quires  domain  knowledge  and  inevitably  a  human  in  the 
loop,  but  it  can  be  significantly  speeded  up  by  first  achieving 
a  rough  clustering  using  a  metric  such  as  the  edit  distance. 
Grouping  closely  related  entries  into  initial  clusters  that  act 
as  representative  strings  has  two  benefits:  (1)  Early  aggre¬ 
gation  acts  as  a  “sorting”  step  that  lets  us  use  more  aggres¬ 
sive  strategies  in  later  stages  with  less  risk  of  erroneously 
separating  closely  related  strings.  (2)  If  an  error  is  made 
in  the  placement  of  a  representative,  only  that  representa¬ 
tive  need  be  moved  to  a  new  location.  Also,  even  the  small 
reduction  in  the  data  size  is  valuable,  given  the  cost  of  the 
subsequent  detailed  analysis  involving  a  domain  expert. 7 

Applying  edit  distance  techniques  to  obtain  such  a  “first 

6NCD  versus  number  of  clusters  is  in  the  full  paper  [16]. 

7  Examples  and  more  details  are  given  in  the  full  paper. 


pass”  clustering  is  quite  expensive,  however,  and  we  there¬ 
fore  applied  BUBBLE-FM  to  this  problem.  We  view  this 
application  as  a  form  of  data  cleaning  because  a  large  num¬ 
ber  of  closely  related  strings  differing  from  each  other  by 
omissions,  additions,  and  transposition  of  characters  and 
words,  are  placed  together  in  a  single  cluster.  Moreover,  it 
is  preparatory  to  more  detailed  domain  specific  analysis  in¬ 
volving  a  domain  expert.  We  compared  BUBBLE-FM  with 
some  other  clustering  approaches  [14,  15],  which  use  rela¬ 
tive  edit  distance  (RED).  Our  results  are  very  promising  and 
indicate  that  BUBBLE-FM  achieves  high  quality  in  much 
less  time. 

We  used  BUBBLE-FM  on  a  real-life  dataset  RDS  of 
about  150,000  strings  (representing  13,884  different  vari¬ 
ants)  to  determine  the  behavior  of  BUBBLE-FM.  Table  3 
shows  our  results  on  the  dataset  RDS.  A  string  is  said  to 
be  misplaced  if  it  is  placed  in  the  wrong  cluster.  Since  we 
know  the  exact  set  of  clusters,  we  can  count  the  number  of 
misplaced  strings.  We  first  note  that  BUBBLE-FM  is  much 
faster  than  RED.  Moreover,  more  than  50%  of  the  time  is 
spent  in  the  second  phase  where  each  string  in  the  dataset  is 
associated  with  a  cluster.  Second,  parameters  in  BUBBLE- 
FM  can  be  set  according  to  the  tolerance  on  misclassifica- 
tion  error.  If  the  tolerance  is  low  then  BUBBLE-FM  returns 
a  much  larger  number  of  clusters  than  RED  but  the  misclas- 
sification  is  much  lower  too.  If  the  tolerance  is  high,  then  it 
returns  a  lower  number  of  clusters  with  higher  misclassifi- 
cation  error. 


Algorithm 

#  of 

clusters 

#  of  misplaced 
strings 

Time 
(in  hrs) 

RED  (run  1) 

10161 

69 

45 

BUBBLE-FM  (run  1) 

10078 

897 

7.5 

BUBBLE-FM  (run  2) 

12385 

20 

7 

Table  3.  Results  on  the  dataset  RDS 
8.  Conclusions 

In  this  paper,  we  studied  the  problem  of  clustering  large 
datasets  in  arbitrary  metric  spaces.  The  main  contributions 
of  this  paper  are: 

1.  We  introduced  the  BIRCH*  framework  for  fast  scal¬ 
able  incremental  pre-clustering  algorithms  and  in¬ 
stantiated  BUBBLE  and  BUBBLE-FM  for  clustering 
data  in  a  distance  space. 

2.  We  introduced  the  concept  of  image  space  to  general¬ 
ize  the  definitions  of  summary  statistics  like  centroid, 
radius  to  distance  spaces. 

3.  We  showed  how  to  reduce  the  number  of  calls  to  an 
expensive  distance  function  by  using  FastMap  with¬ 
out  deteriorating  the  clustering  quality. 
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